Download PyQuery: A Search Engine for Python Packages and

Document related concepts
no text concepts found
Transcript
Florida State University Libraries
2015
Pyquery: A Search Engine for Python
Packages and Modules
Shiva Krishna Imminni
Follow this and additional works at the FSU Digital Library. For more information, please contact [email protected]
FLORIDA STATE UNIVERSITY
COLLEGE OF ARTS AND SCIENCES
PYQUERY:
A SEARCH ENGINE FOR PYTHON PACKAGES AND MODULES
By
SHIVA KRISHNA IMMINNI
A Thesis submitted to the
Department of Computer Science
in partial fulfillment of the
requirements for the degree of
Master of Science
2015
c 2015 Shiva Krishna Imminni. All Rights Reserved.
Copyright Shiva Krishna Imminni defended this thesis on November 13, 2015.
The members of the supervisory committee were:
Piyush Kumar
Professor Directing Thesis
Sonia Haiduc
Committee Member
Margareta Ackerman
Committee Member
The Graduate School has verified and approved the above-named committee members, and certifies
that the thesis has been approved in accordance with university requirements.
ii
I dedicate this thesis to my family. I am grateful to my loving parents, Nageswara Rao and Subba
Laxmi, who made me the person I am today. I am thankful to my affectionate sister, Ramya
Krishna, who is very special to me and always stood by my side.
iii
ACKNOWLEDGMENTS
I owe thanks to many people. Firstly, I would like to express my gratitude to Dr. Piyush Kumar for
directing my thesis. Without his continuous support, patience, guidance and immense knowledge,
PyQuery wouldn’t be so successful. He truly made a difference in my life by introducing me to
Python programming language and helped me learn how to contribute to Python community. He
trusted me and remained patient during the difficult times. I would also like to thank Dr. Sonia
Haiduc and Dr. Margareta Ackerman for participating on the committee, monitoring my progress
and providing insightful comments. They helped me learn multiple perspectives that widened
my research. I would like to thank my team members Mir Anamul Hasan, Michael Duckett,
Puneet Sachdeva and Sudipta Karmakar for their time, support, commitment and contributions to
PyQuery.
iv
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
1 Introduction
1.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
2
2 Related Work
4
3 Data Collection
3.1 Package Level Search . . . . . .
3.1.1 Metadata - Packages . .
3.1.2 Code Quality . . . . . .
3.2 Module Level Search . . . . . .
3.2.1 Mirror Python Packages
3.2.2 Metadata - Modules . .
3.2.3 Code Quality . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
10
10
11
12
4 Data Indexing and Searching
4.1 Data Indexing . . . . . . . . .
4.1.1 Package Level Search .
4.1.2 Module Level Search .
4.2 Data Searching . . . . . . . .
4.2.1 Package Level Search .
4.2.2 Module Level Search .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
14
14
14
16
16
17
5 Data Presentation
5.1 Server Setup . . . . . . . .
5.2 Browser Interface . . . . . .
5.3 Package Level Search . . . .
5.3.1 Ranking of Packages
5.4 Module Level Search . . . .
5.4.1 Preprocessing . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
23
26
26
.
.
.
.
.
.
6 System Level Flow Diagram
31
7 Results
36
v
8 Conclusions and Future Work
42
8.1 Thesis Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.2 Recommendation for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Biographical Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
vi
LIST OF TABLES
5.1
Ranking matrix for keyword music.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2
Matching packages and their scores for keyword music.
7.1
Results comparison for keyword - requests. . . . . . . . . . . . . . . . . . . . . . . . . 37
7.2
Results comparison for keyword - flask. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3
Results comparison for keyword - pygments.
7.4
Results comparison for keyword - Django. . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.5
Results comparison for keyword - pylint. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.6
Results comparison for keyword - biological computing. . . . . . . . . . . . . . . . . . 39
7.7
Results comparison for keyword - 3D printing. . . . . . . . . . . . . . . . . . . . . . . 40
7.8
Results comparison for keyword - web development framework. . . . . . . . . . . . . . 40
7.9
Results comparison for keyword - material science. . . . . . . . . . . . . . . . . . . . . 41
7.10
Results comparison for keyword - google maps. . . . . . . . . . . . . . . . . . . . . . . 41
vii
. . . . . . . . . . . . . . . . . 29
. . . . . . . . . . . . . . . . . . . . . . . 38
LIST OF FIGURES
3.1
Metadata from PyPI for package requests. . . . . . . . . . . . . . . . . . . . . . . . . .
9
3.2
Pseudocode for collecting metadata of a module. . . . . . . . . . . . . . . . . . . . . . 12
3.3
Metadata for a module in Flask package. . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1
Package level data index mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2
River definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3
Custom Analyzer with custom pattern filter. . . . . . . . . . . . . . . . . . . . . . . . 17
4.4
Module level data indexing mapping.
4.5
Package level search query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6
Module level search query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.1
Package modal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2
Package statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3
Other packages from author. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.4
Pseudocode for ranking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.5
Module modal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.1
System Level Flow Diagram of PyQuery. . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2
PyQuery homepage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.3
PyQuery package level search template. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4
PyQuery module level search template. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
. . . . . . . . . . . . . . . . . . . . . . . . . . . 18
viii
ABSTRACT
Python Package Index (PyPI) is a repository that hosts all the packages ever developed for the
Python community. It hosts thousands of packages from different developers and for the Python
community, it is the primary source for downloading and installing packages. It also provides
a simple web interface to search for these packages. A direct search on PyPI returns hundreds
of packages that are not intuitively ordered, thus making it harder to find the right package.
Developers consequently resort to mature search engines like Google, Bing or Yahoo which redirect
them to the appropriate package homepage at PyPI. Hence, the first task of this thesis is to improve
search results for python packages.
Secondly, this thesis also attempts to develop a new search engine that allows Python developers
to perform a code search targeting python modules. Currently, the existing search engines classify
programming languages such that a developer must select a programming language from a list. As
a result every time a developer performs a search operation, he or she has to choose Python out
of a plethora of programming languages. This thesis seeks to offer a more reliable and dedicated
search engine that caters specifically to the Python community and ensures a more efficient way to
search for Python packages and modules.
ix
CHAPTER 1
INTRODUCTION
Python is a high-level programming language based on simplicity and efficiency. It emphasizes
code simplicity and can perform more functions in fewer lines of code. In order to streamline code
and speed up development many programmers use application packages, which reduce the need to
copy definitions into each program. These packages consist of written application software in the
form of different modules, which contain the actual code. Python’s main power exists within these
packages and the wide range of functionality that they bring to the software development field.
Providing ways and means to deliver information about these reusable components is of utmost
importance.
PyPI, a software repository for Python packages, offers a search feature to look for available
packages meeting the user needs. It implements a trivial ranking algorithm to detect matching
packages for a user’s keyword, resulting in a poorly sorted, huge list of closely and similarly scored
packages. From this immense list of results, it’s hard to find a package efficiently in a reasonable
amount of time that meets user needs. Due to lack of an efficient native search engine for the
Python community, often developers rely on mature and multipurpose search engines like Google,
Yahoo and Bing. In order to express his or her interests in python packages, a developer taking this
route has to articulate his query and on top of that provide additional input. A dedicated search
engine for the Python community would bypass the need to specify once interest in Python. One
may argue that such a search engine wouldn’t alter the experience of developers who is searching for
python packages. However, considering the number of times a developer search for packages, code
and related information on an average is five search sessions with 12 total queries each workday [1],
it is desirable to propose a search engine that saves a lot of time and effort.
An additional method of software development that saves time is the practice of Code reuse
[2]. There are many factors that influence practice of code reuse [3] [4]. One of the such factors is
availability of the right tools to find reusable components. Searching for code is a serious challenge
1
for developers. Code search engines such as Krugle1 , Searchcode2 and BlackDuck3 attempted to
ameliorate the hardship to search for code by targeting a wide range of languages. Currently,
Python developers who conduct code searches have to learn how to configure these search engines
so that the search engines display python specific results. As a result all though these search engines
exemplify the ideal that one product may solve all kinds of problems, such an ideal fails to overcome
the problems faced by Python developers. Python developers would rather rely on a code search
engine that is designed for searching python packages exclusively.
1.1
Objective
This thesis seeks to contribute to the Python community by developing a dedicated Python
search engine (PyQuery)4 that enables Python developers to search for Python Packages and Modules (code) and encourage them to take benefit of an important software development practice,
code reuse. With PyQuery we want to facilitate the search process, improve the query results, and
collect packages and modules into one user-friendly interface that provides the functionality of a
myriad of code search tools. PyQuery will also be able to synchronize with the Python Package
Index to provide users with code documentation and downloads, thereby providing all steps in the
code search process.
1.2
Approach
PyQuery is organized into three separate components: Data Collection, Data Indexing and Data
Presentation. The Data Collection component is responsible for collecting all the data required
to facilitate the search operation. This data includes source code for packages, metadata about
packages from PyPI, preprocessed metadata about modules, code quality analysis for packages and
other relevant information that helps us deliver meaningful search results. In order to provide the
most recent updates to packages at PyPI, we ensure that the data we collect and maintain is always
in sync with changes made at PyPI. For this reason, we use the Bandersnatch5 mirror client of
PyPI which keeps track of changes utilizing state files.
1
http://www.krugle.com/
https://searchcode.com/
3
https://code.openhub.net/
4
https://pypi.compgeom.com/
5
https://pypi.python.org/pypi/bandersnatch
2
2
The Data Indexing component stores all the data we have collected and processed in the Data
Collection Module in a structured schema that facilitates faster search queries for matching packages
and modules. We used Elasticsearch (ES)6 , a flexible and powerful, open source, distributed, realtime search and analytics engine, built on top of Apache Lucene to index our data. We rely on
FileSystem River (FSRiver)7 , an ES plugin, to index documents from the local file system using
SSH. In ES, we used separate indexes for files related to module level search to those of package
level search. By using this approach, we can map each query to its specific type and related files.
The Data Presentation component delivers matched search results to the user in a fashion that
is both appealing and easy to follow. We have used Flask8 for server side scripting. When a user
query for matching packages, we send a query to the index responsible for packages and retrieve
required details that allow the user to see the most significant packages, their scores, statistics and
other relevant information. We implemented a ranking algorithm that works on fine tuning ES
results by sorting them based on various metric. Additionally, when a user query for matching
modules, a request is sent to the ES index for modules that contain metadata (Ex: class name,
method name, etc.) to get a list of matches along side their line number and path to module on the
server. For every match, a code snippet containing matching line is rendered using Pygments9 . To
reduce time for processing matched results, all the modules are preprocessed with pygments and
each line number is matched to their starting byte address in the file, so that the server can quickly
open the pygment file, seek to the calculated byte location, and pull the required piece of HTML
code snippet.
6
https://www.elastic.co/products/elasticsearch
https://github.com/dadoonet/fsriver
8
http://flask.pocoo.org/
9
http://pygments.org/
7
3
CHAPTER 2
RELATED WORK
Search engines employ metrics for scoring and ranking, but these metrics are often limited and do
not differentiate the significant packages. Additionally, these metrics do not exhibit all the qualities
that may be relevant to what a user wants out of a specific module or package.
The PyPI [5] website signifies the exemplar for this project. When one searches for packages,
PyPI follows a very simple search algorithm which gives a score for each package based on the
query. Certain fields such as name, summary, and keywords are matched against the query and a
binary score for each field is computed (basically a “yes; it matched” or “no; it didn’t”). A weight
is given for each field, and the composite scores from each field are added to create a total score
for each package. Packages are first sorted by score and then in lexicographical order of package
names. We found this information at stackoverflow1 and followed the steps given to confirm the
working of the PyPI searching algorithm.
The above method employed by PyPI works, but it doesn’t distinguish the packages very well.
For example, searching for “flask” will yield 1750 results with a top score of 9 given to 162 packages
(fortunately, the Flask web framework, which should be at the top when searched, is listed at 4th
due to sorting based on alphabetical order). This also makes it very easy to influence the outcome
of popular queries if you are the package developer. An algorithm which resist the influence of a
package owner would be a better fit for reliable package searches.
PyPI Ranking [6] is another website created by Taichino that is similar to PyPI and PyQuery
as it searches only Python Packages, and no other languages. It has a search function that takes
in a user’s search query and finds relevant packages. It also syncs with PyPI so that the user
can access the information contained on PyPI such as documentation and downloads. The main
difference, however, is that PyPI Ranking ranks packages based only on the number of downloads,
so packages with more downloads will emerge higher up on the list. This means that packages
get more value based on their popularity, which is a valuable metric, but not the only valuable
1
http://stackoverflow.com/questions/28685680/what-does-weight-on-search-results-in-pypi-helpin-choosing-a-package
4
metric. Furthermore, the website only allows a package level search, whereas PyQuery contains
both package level search function and module level search function, providing more resources to
the user. Additionally, the website makes use of Django to facilitate the web development whereas
PyQuery uses the microframework Flask.
There are multiple code search engines that allow users to look for written code that relates to
their search query. These code search engines include websites such as Krugle2 , Open Hub Code
Search - BlackDuck3 , and SearchCode4 . These websites allow users to enter a search query, and
then they list sample lines of code based on the results from their search query. These websites
are limited; however, because they can only search code at GitHub5 , Sourceforge6 and other open
source repositories. They are not contained within one context in which a user might want to
find a specific package or module. Additionally, the websites do a search based purely on term
occurrence, by identifying the user’s search term within the lines of code and returning the code
samples with numerous hits. The results a user receives on their search key may not address what
they want, but rather just contain the term itself. Consequently, the results are not scored due to
the lack of relevant information to incorporate as metrics. PyQuery accesses data directly from
PyPI, preprocess the data to extract useful information from code, indexes the data within itself,
searches the data, and reorders it based on ranking function. PyQuery is also constructed within
the Python community so that Python packages and modules are only ranked against other Python
packages and modules. These results are more valuable due to the metrics they are based on and
the nature of the searching algorithm.
In the past, people have attempted to do code search in languages like Java7 based on semantics
that are test driven [7] and required users to specify what they are searching for as precisely as
possible. This means that they need to provide both the syntax and the semantics of their target.
Furthermore pass a set of test cases that include sample input and expected output to filter potential
set of matches. This is a great technique to search for code that can be reused; however, it has its
limitations. This tool requires the kind of detail regarding the input that the user will not know in
the first place. This tool is more helpful for testing the reusable coding entity whose path from the
2
http://www.krugle.com/
https://code.openhub.net/
4
https://searchcode.com/
5
https://github.com/
6
http://sourceforge.net/
7
http://www.oracle.com/technetwork/java/index.html
3
5
package root ex: str.capitalize() is known to the user precisely. If a user is looking for code that
capitalizes first character of string, he may guess the function name to be capitalize() but may not
precisely know it can be found in str package with the signature str.capitalize(). If a user usually
knows this information, he or she may directly look inside usage documents to see if it meets his
or her requirements (though he or she may have to execute test cases on their own).
Nullege is a dedicated search engine for Python source code [8]. As a keyword, it requires
a UNIX path like structure (“/” replaced by “.”) used in python import statements that always start at the level of the package root folder. Some of the sample queries for Nullege include “flask.app.Environment”, “requests.adapters.BaseAdapter” and “scipy.add”. Results from
the search operation on Nullege point to the source code where the programming entity is imported. This is a useful tool for users who are familiar with folder structure of the package and
are generally curious in exploring its source code or to learn packages that import them. A user
can’t directly pass a generic keyword(s) that infers the purpose of programming entity he or she is
interested in. For users who want to learn if there exist a reusable component for a specific task
at hand and are not aware of precise location to look at, Nullege is not the right tool. Because
of limitations imposed on the input and type of results returned, Nullege can be classified as an
exploration tool for source code tree of Python packages rather than a search engine for source code.
PyQuery allows users to perform a generic keyword search without limitations in input like those
of Nullege. PyQuery results are usually code snippets that point to definitions of programming
entities rather than import statements.
We have used Abstract Syntax Tree (AST) to collect various programming entity names and
their line numbers in modules for code search. Many research topics that analyze software code
often use AST. Some of the applications of AST include Semantics-Based Code Search [7], Understanding source code evolution using abstract syntax tree matching [9] and Employing Source Code
Information to Improve Question-Answering in Stack Overflow [10]. These implementations construct an ast for code at consideration and extract needed information by walking through the tree
or directly visiting the required node. For this purpose, we have used ast module [11] in Python.
Chapter 3 elaborates on how we extract metadata about modules for code search.
6
CHAPTER 3
DATA COLLECTION
For any search engine to work, it requires data to perform search operations. Data could be
anything. It could be of any form and any type. For the problem we plan to solve, we have to
address the question “What kind of data are we interested in?”. We are engrossed in data related
to Python packages that can help us return meaningful results for a user query. We intend to
provide two flavors to the search engine: Package Level Search and Module Level Search. Let us
examine tools and configurations that help us collect required data to achieve this goal.
3.1
Package Level Search
A package is a collection of modules, which are meant to solve the problem(s) of some type. For
example: “requests” package is developed to handle http capabilities. According to its homepage1 ,
it has various features, including International Domains and URLs, Keep-Alive and Connection
Pooling, Sessions with Cookie Persistence, Browser-style SSL Verification, etc. A user interested in
these features would like to use this library to solve his or her problem. A developer may produce
a library and assign a name to it that may or may not directly have any relation with the purpose
of the library. A user would get to know whether the library helps solve his problem not just by
looking at its name alone but the description, sample usage and other useful metadata about the
package mentioned at its homepage. Sometimes when a user has to pick between multiple packages
that are trying to solve the same problem, criteria like popularity of author, number of downloads,
frequency of releases, code quality and efficiency starts to factor. A search engine that returns
Python package as matches to a user query would require similar information.
3.1.1
Metadata - Packages
We have discussed how a developer’s description of a package on its homepage, frequency of
release, code quality, popularity of author and number of downloads helps a user to decide whether
1
http://docs.python-requests.org/en/latest/
7
a given library solves his or her problem. PyQuery needs this information to search for matching
packages for the user’s query and if multiple packages qualify, to prioritize one over the other. One
direct way to get description of a package is to crawl its homepage at PyPI2 . Though this sounds
pretty straight forward and easy, gathering URL information for the latest stable release for each
package and maintaining this information could be tricky and searching for required information
in crawled data could be time consuming.
We found an elegant and much simpler way to gather metadata of a package. PyPI allows users
to access metadata information about a package via a http request 3 . This would return a JSON
file with keys such as description, author, package url, downloads.last month, downloads.last week,
downloads.last day, releases.x.x.x, etc. For example: one can query the PyPI website for metadata
about “requests” package through URL 4 . Refer to Figure 3.1 for a sample response from PyPI.
3.1.2
Code Quality
PEP 0008 – Style Guide for Python Code5 , describes a set of semantic rules and guidelines
that Python developers should incorporate into their code. These standards are highly encouraged
by the Python community. Standard libraries that are shipped with installation are written using
these conventions. One main reason to emphasize the standardizing style guide is to increase code
readability. The Python code base is pretty huge, and it is important to maintain consistency
across it. Conventions set in Python style guide makes Python language so beautiful and easy to
follow as you read.
Code Quality of a package can be measured in multiple ways. First, we can check if the package
at consideration is following the style guide for Python code. The Python community has tools to
check the package compliance with the style guide. PEP86 is a simple Python module that uses
only standard libraries and validates any Python code against the PEP 8 style guide. Pylint7 is
another such tool that checks for line length, variable names, unused imports, duplicate code and
other coding standards against PEP 8.
2
https://pypi.python.org/pypi
http://pypi.python.org/pypi/<package_name>/json
4
http://pypi.python.org/pypi/requests/json
5
https://www.python.org/dev/peps/pep-0008/
6
https://pypi.python.org/pypi/pep8
7
http://www.pylint.org/
3
8
‘ ‘ info ’ ’ : {
...
‘ ‘ p a c k a g e u r l ’ ’ : ‘ ‘ h t t p : / / pypi . python . o r g / pypi / r e q u e s t s ’ ’ ,
‘ ‘ author ’ ’ : ‘ ‘ Kenneth R e i t z ’ ’ ,
‘ ‘ a u t h o r e m a i l ’ ’ : ‘ ‘ me@kennethreitz . com ’ ’ ,
‘ ‘ d e s c r i p t i o n ’ ’ : ‘ ‘ R e q u e s t s : HTTP f o r Humans . . . ’ ’
...
...
‘ ‘ r e l e a s e u r l ’ ’ : ‘ ‘ h t t p : / / pypi . python . o r g / pypi / r e q u e s t s
/2.7.0 ’ ’ ,
‘ ‘ downloads ’ ’ : {
‘ ‘ last month ’ ’ : 4002673 ,
‘ ‘ last week ’ ’ : 1307529 ,
‘ ‘ l a s t d a y ’ ’ : 198964
},
...
...
‘ ‘ releases ’ ’: {
‘ ‘1.0.4 ’ ’: [
{
‘ ‘ has sig ’ ’ : false ,
‘ ‘ u p l o a d t i m e ’ ’ : ‘ ‘2012 −12 −23T07 : 4 5 : 1 0 ’ ’ ,
‘ ‘ comment text ’ ’ : ‘ ‘ ’ ’ ,
‘ ‘ python versio ’ ’ : ‘ ‘ source ’ ’ ,
‘ ‘ u r l ’ ’ : ‘ ‘ h t t p s : / / pypi . python . o r g / p a c k a g e s / s o u r c e / r / r e q u e s t s /
r e q u e s t s − 1 . 0 . 4 . t a r . gz ’ ’ ,
‘ ‘ md5 dige st ’ ’ : ‘ ‘ 0 b7448f9e1a077a7218720575003a1b6 ’ ’ ,
‘ ‘ downloads ’ ’ : 1 1 1 7 6 8 ,
‘ ‘ f i l e n a m e ’ ’ : ‘ ‘ r e q u e s t s − 1 . 0 . 4 . t a r . gz ’ ’ ,
‘ ‘ packagetype ’ ’ : ‘ ‘ s d i s t ’ ’ ,
‘ ‘ s i z e ’ ’ : 336280
}
],
...
...
}
}
Figure 3.1: Metadata from PyPI for package requests.
9
PEP8 and Pylint are great tools that we can use to check for code quality, but we have decided
to use Prospector8 that brings together both the functionality of Pep8 and Pylint. It also adds the
functionality of the code complexity analysis tool called McCabe9 . When a package is processed
with Prospector, it will give a count of errors, warnings, messages and their detail description. This
information gives an inference of Code Quality.
There is another set of information we can use for analyzing code quality. Developers care
about how well the code is commented. The ratio of the number of comment lines to the total
number of lines, number of code lines to the total number of lines and number of warnings to the
number of lines, offers some metrics to do code quality analysis. CLOC10 helps us acquire this
information. As CLOC stands for Count Lines of Code, when we run CLOC on Python package
at consideration, it returns the total number of files, number of comments, number of blank lines
and number of code lines. We collect this information to check for Code Quality.
3.2
Module Level Search
A module is a Python file with extension “.py”. It is a collection of multiple programing
units such as classes, methods and variables. Some developers are interested in searching for these
programming entities in a module, so we wanted to build a search engine for them. There are
various steps involved in achieving this goal.
3.2.1
Mirror Python Packages
In order to allow users to perform module level search, i.e., allow users to search for classes,
methods and variables, we need to extract this information from modules and packages that hold
them. We are interested in the source code of all Python packages available. If there is a new release
for any package, we already have, we want to update the information we have on this package. All
of these operations could be complex or cumbersome if we were to do it by automating the process
of downloading source code from their respective homepage (assuming we somehow managed to
collect source code download URL for all packages). We found a better alternative. We came
across a practice followed by software development organizations. Some of these organizations
8
https://github.com/landscapeio/prospector
https://pypi.python.org/pypi/mccabe
10
http://cloc.sourceforge.net/
9
10
wouldn’t like their developers to hit the world wide web to download software packages they need
for development. Instead, they maintain a local mirror of the PyPI repository from which developers
can download necessary packages without connecting to the Internet.
Currently, PyPI is a home to 50,000+ packages. It would be a single point of failure if it goes
down. In order to avoid such disaster, PyPI has come up with PEP 38111 , a mirroring infrastructure
that can clone an entire PyPI repository in a desired machine. People started making public and
private repositories using this infrastructure. For our purposes, we use Bandersnatch12 , a client
side implementation of PEP 381 to sync Python packages. When bandersnatch is executed for
the first time it will mirror the entire PyPI i.e., download all the Python packages. It will also
maintain state files that help maintain the current state of the repository, which is later used to
sync with PyPI to get any updates made to the packages. A recurring cron job to execute command
“bandersnatch mirror” will keep the local repository always updated.
3.2.2
Metadata - Modules
We have previously discussed that developers show interest in doing a code search for programming entities. We mirrored the entire PyPI repository into our servers using bandersnatch. In order
to enable code search, we have to find useful information from modules of each package, i.e., get a
list of programming entities for each module. There are many programming entities in a Python
module, but we are mainly interested in classes, functions under classes, variables under classes,
global functions, recurring inner functions inside global functions, variables inside global functions
and global variables. We maintain each of them in a separate key so that we can give more weight
to certain entities than others.
To collect required information, we iterate through all packages; with in each package we iterate
through all modules; for each module, we construct an Abstract Syntax Tree using ast13 module
from Python and perform a walk (visit all) operation on this tree. As walk operation visits each
programming entity, it invokes various function calls inside ast.NodeVisitor such as visit Name,
visit FunctionDef, visit ClassDef and so on as per the current element. We override ast.NodeVisitor
class and functions inside it and perform visit all operation on top of it so that we have control
over the operation performed inside them. For example, during visit all, if a class is being visited, a
11
https://www.python.org/dev/peps/pep-0381/
https://pypi.python.org/pypi/bandersnatch
13
https://docs.python.org/2/library/ast.html
12
11
# Sample Code for collecting metadata
class PyPINodeVisitor(ast.NodeVisitor):
def visit_Name(self, node):
# collect variable name and line number
def visit_FunctionDef(self, node):
# collect function name and line number
def visit_ClassDef(self, node):
# collect class name and line number
def visit_all(self, node):
# call super class visit function
if __name__ == "__main__":
for modules in packages:
for module in modules:
tree = ast.parse(module)
JSONfile = PyPINodeVisitor().visit_all(tree)
Figure 3.2: Pseudocode for collecting metadata of a module.
function call to visit ClassDef is invoked. Since we have overridden this function, we are in control
of information passed to it and decide what to do with it. We can collect information that is of
interest to us such as names of the various classes and line number at which they occur. Figure
3.2 is the pseudocode for using ast to generate the required metadata of a module. This way, we
can collect all the metadata for a module and save it in a JSON format for making it available for
Module Level Search. Figure 3.3 is an example of one such JSON file we have generated using this
process. Each identifier is concatenated with its line number and additional underscores to make a
length of minimum 18. The reason behind this format is discussed in Chapter 4. As part of data
collection, it is important that information being collected should be stored in an agreed format
that enables better indexing and searching techniques.
3.2.3
Code Quality
Similar to the method we have applied to collect code quality for packages using Prospector and
CLOC; we have also collected this information at the module level. We couldn’t use Prospector to
process a single module like we did for package, so we used Pylint instead of Prospector. CLOC
helped towards obtaining the number of comment lines, number of blank lines and number of code
lines at the module level.
12
{
‘ ‘ class ’ ’ : ‘ ‘ SessionMixin 28
TaggedJSONSerializer 55
SecureCookieSession 109 NullSession 119
SessionInterface 134 SecureCookieSessionInterface 272 ’ ’ ,
‘ ‘ class function ’ ’: ‘ ‘ get signing serializer 290
open session 301
save session 315 ’ ’ ,
‘ ‘ class variable ’ ’: ‘ ‘ salt 278
digest method 280
key derivation 283 s e r i a l i z e r 2 8 7
session class 288
self 290
app 290
signer kwargs 293
self 301
app 301
request 301
val 305
max age 308
self 315
app 315
session 315
response 315
domain 316
path 317
httponly 323
secure 324
expires 325
val 326
data 310
’’,
‘ ‘ function ’ ’ : ‘ ‘ total seconds 24 ’ ’ ,
‘ ‘ function function ’ ’ : ‘ ‘ ’ ’ ,
‘ ‘ function var ’ ’ : ‘ ‘ ’ ’ ,
‘ ‘ module ’ ’ : ‘ ‘ s e s s i o n s ’ ’ ,
‘ ‘ module path ’ ’ : ‘ ‘ Flask − 0 . 1 0 . 1 / f l a s k / s e s s i o n s . py ’ ’ ,
‘ ‘ variable ’ ’ : ‘ ‘ s e s s i o n j s o n s e r i a l i z e r 1 0 6 ’ ’
}
Figure 3.3: Metadata for a module in Flask package.
13
CHAPTER 4
DATA INDEXING AND SEARCHING
We used Elasticsearch (ES)1 , a flexible and powerful, open source, distributed, real-time search and
analytics engine, built on top of Apache Lucene to index our data and query the indexed data for
both package level search and module level search. FileSystem River (FSRiver)2 , an ES plugin is
used to index documents from a local file system or remote file system (using SSH).
4.1
4.1.1
Data Indexing
Package Level Search
Extracted data for each Python package is indexed in ES using FSRiver, where data for each
Python package is considered a document. Although all fields in a document are indexed in the ES
server, only the following fields: name, summary, keywords and description are analyzed (Refer to
Figure 4.1 for ES mapping) before they are indexed using ES Snowball analyzer3 . Snowball analyzer
generates tokens using the standard tokenizer, removes English stop words and uses standard filter,
lowercase filter and snowball filter. The other fields are not analyzed before indexing either because
they are numbers (eg: info.downloads.last month) or are of no interest with respect to search query.
Figure 4.2 depicts the river definition which actually indexes the package level data (in .json
format) located in the server, looks for updates every 12 hour and reindex data if there is any
update.
4.1.2
Module Level Search
Extracted data for each module in a Python package is indexed in ES, where data for each
module in a Python package is considered as a document. All fields except module path in a
document are analyzed using a custom analyzer (Refer to Figure 4.3 for the definition of the
custom analyzer) before they are indexed. The custom analyzer generates tokens using a custom
1
https://www.elastic.co/products/elasticsearch
https://github.com/dadoonet/fsriver
3
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
2
14
PUT packagedata / p a c k a g e i n d e x / mapping
{
‘ ‘ packageindex ’ ’ : {
‘ ‘ properties ’ ’ : {
‘ ‘ info ’ ’ : {
‘ ‘ properties ’ ’ : {
‘ ‘ name ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ an al yz ed ’ ’ ,
‘ ‘ analyzer ’ ’ : ‘ ‘ snowball
},
‘ ‘ summary ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ an al yz ed ’ ’ ,
‘ ‘ analyzer ’ ’ : ‘ ‘ snowball
},
‘ ‘ keywords ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ an al yz ed ’ ’ ,
‘ ‘ analyzer ’ ’ : ‘ ‘ snowball
},
‘ ‘ description ’ ’:{
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ an al yz ed ’ ’ ,
‘ ‘ analyzer ’ ’ : ‘ ‘ snowball
}
}
}
}
}
}
’’
’’
’’
’’
Figure 4.1: Package level data index mapping.
tokenizer (splitting by underscore, camel case and numbers, at the same preserves the original
token) and uses ES snowball filter4 . The module path field is not analyzed as we will not look for
matches in this field for a search query. We used fast vector highlighter5 (by setting term vector
to with positions offsets in the mapping) instead of the plain highlighter in module level search.
4
5
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/analysis-snowball-tokenfilter.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/search-request-highlighting.html
15
PUT
{
r i v e r / p a c k a g e i n d e x r i v e r / meta
‘ ‘ type ’ ’ : ‘ ‘ f s ’ ’ ,
‘ ‘ fs ’ ’ : {
‘ ‘ u r l ’ ’ : ‘ ‘ / s e r v e r / package / data / d i r e c t o r y / path ’ ’ ,
‘ ‘ update rate ’ ’ : ‘ ‘12h ’ ’ ,
‘ ‘ json support ’ ’ : true
},
‘ ‘ index ’ ’ : {
‘ ‘ index ’ ’ : ‘ ‘ packagedata ’ ’ ,
‘ ‘ type ’ ’ : ‘ ‘ packageindex ’ ’
}
}
Figure 4.2: River definition.
The fast vector highlighter lets you define fragment size and number of matching fragments to be
returned. Figure 4.4 depicts the mapping defined for module level data indexing.
4.2
Data Searching
We define our search query according to ES Query DSL to look for matches in the index for user
queries. ES uses the Boolean model to find matching documents, and a formula called the practical
scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse
document frequency and the vector space model but adds modern features like a coordination factor,
field length normalization, and term or query clause boosting6 .
4.2.1
Package Level Search
Figure 4.5 depicts the query used for package level search. ES looks for matches for the user
search query in the following fields: name, author, summary, description and keywords. Based
on the matches it ranks the results and returns name, author, summary, description, version,
keywords, number of downloads in the last month for each top n ranked Python package, where n
is the number of matching packages requested. Summary and description of a matched package are
6
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
16
PUT moduledata
{
‘ ‘ settings ’ ’ : {
‘ ‘ analysis ’ ’ : {
‘ ‘ filter ’ ’ : {
‘ ‘ code1 ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ p a t t e r n c a p t u r e ’ ’ ,
‘ ‘ preserve original ’ ’ : 1 ,
‘ ‘ patterns ’ ’ : [
‘ ‘ ( \ \ p{ Ll }+|\\p{Lu}\\p{ Ll }+|\\p{Lu}+) ’ ’ ,
‘ ‘ ( \ \ d+) ’ ’
]
}
},
‘ ‘ analyzer ’ ’ : {
‘ ‘ code ’ ’ : {
‘ ‘ tokenizer ’ ’ : ‘ ‘ pattern ’ ’ ,
‘ ‘ f i l t e r ’ ’ : [ ‘ ‘ code1 ’ ’ , ‘ ‘ l o w e r c a s e ’ ’ , ‘ ‘ s n o w b a l l ’ ’
]
}
}
}
}
}
Figure 4.3: Custom Analyzer with custom pattern filter.
returned in the section where matches are highlighted using the <em> tag, which we will utilize
in the user interface while displaying the results.
4.2.2
Module Level Search
Figure 4.6 depicts the query used for module level search. ES detects matches to the user
search query in the following fields: class, class function, class variable, function, function function,
function var and variable of a module document. Different weights are assigned for matches in
a different field based on their importance. For example, a match in the class field will weigh
more than a match in the function field in a module. Weights are assigned using a caret (ˆ)
sign followed by a number. Based on the matches, it ranks the results and returns the path to
the module (module path) where match occurred. Using this information, we will retrieve the
17
PUT moduledata / moduleindex / mapping
{
‘ ‘ pypimtype ’ ’ : {
‘ ‘ properties ’ ’ : {
‘ ‘ module ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a n a l y z e r ’ ’ : ‘ ‘ code ’ ’
},
‘ ‘ module path ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ index ’ ’ : ‘ ‘ n o t a n a l y z e d ’ ’
},
‘ ‘ class ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a n a l y z e r ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘ ‘ w i t h p o s i t i o n s
},
‘ ‘ class function ’ ’: {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a n a l y z e r ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘ ‘ w i t h p o s i t i o n s
},
‘ ‘ class variable ’ ’: {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a n a l y z e r ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘ ‘ w i t h p o s i t i o n s
},
‘ ‘ function ’ ’ : {
‘ ‘ type ’ ’ : ‘ ‘ s t r i n g ’ ’ ,
‘ ‘ s t o r e ’ ’ : ‘ ‘ yes ’ ’ ,
‘ ‘ a n a l y z e r ’ ’ : ‘ ‘ code ’ ’ ,
‘ ‘ term vector ’ ’ : ‘ ‘ w i t h p o s i t i o n s
},
...
...
}
}
}
offsets ’ ’
offsets ’ ’
offsets ’ ’
offsets ’ ’
Figure 4.4: Module level data indexing mapping.
18
GET packagedata / p a c k a g e i n d e x / s e a r c h
{
‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ s e a r c h query ’ ’ ,
‘ ‘ o p e r a t o r ’ ’ : ‘ ‘ or ’ ’ ,
‘ ‘ f i e l d s ’ ’ : [ ‘ ‘ i n f o . name ˆ 3 0 ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o .
summary ’ ’ , ‘ ‘ i n f o . v e r s i o n ’ ’ , ‘ ‘ i n f o . keywords ’ ’ ]
}
},
‘ ‘ fields ’ ’: [
‘ ‘ i n f o . name ’ ’ , ‘ ‘ i n f o . author ’ ’ , ‘ ‘ i n f o . summary ’ ’ , ‘ ‘ i n f o . v e r s i o n
’ ’ , ‘ ‘ i n f o . keywords ’ ’ , ‘ ‘ i n f o . downloads . l a s t m o n t h ’ ’
],
‘ ‘ highlight ’ ’ : {
‘ ‘ fields ’ ’: {
‘ ‘ summary ’ ’ : { } ,
‘ ‘ description ’ ’:{}
}
}
}
Figure 4.5: Package level search query.
module .py file and extract the relevant code segment for each matching module. We used Fast
Vector Highlighter (FVH) for module level search, which returns the top n fragments from each
field (class, class function, class variable, function, function function, function var and variable)
in a module document. The fragment size has been strategically defined to be 18, which is the
minimum fragment size ES allows. While forming the module level search documents we appended
the line number to each variable where it appears so that we can use the module path information
and line number information to retrieve the relevant code segment. If a variable appended with
the line number is less than 18 characters long, we appended underscores to make it 18. This trick
is useful because ES FVH creates fragment based on fragment size and word boundary. This way,
ES creates fragments for each variable in a field, looks for matches in each field to the search query
and returns n top fragments, where n is the number of fragments set in the query.
19
GET moduledata / moduleindex / s e a r c h
{
‘ ‘ query ’ ’ : {
‘ ‘ multi match ’ ’ : {
‘ ‘ query ’ ’ : ‘ ‘ u s e r query ’ ’ ,
‘ ‘ fields ’ ’ : [
‘ ‘ class ˆ5 ’ ’ , ‘ ‘ class function ’ ’ , ‘ ‘ class variable ’ ’ , ‘ ‘
function ˆ4 ’ ’ , ‘ ‘ function function ’ ’ , ‘ ‘ function var
’ ’ , ‘ ‘ variable ˆ3 ’ ’]
}
},
‘ ‘ fields ’ ’: [
‘ ‘ module path ’ ’
],
‘ ‘ source ’ ’ : false ,
‘ ‘ highlight ’ ’ : {
‘ ‘ order ’ ’ : ‘ ‘ score ’ ’ ,
‘ ‘ r e q u i r e f i e l d m a t c h ’ ’ : true ,
‘ ‘ fields ’ ’ : {
‘ ‘ class ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 ,
‘ ‘ f r a g m e n t s i z e ’ ’ : 18
},
‘ ‘ class function ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 ,
‘ ‘ f r a g m e n t s i z e ’ ’ : 18
},
‘ ‘ class variable ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 ,
‘ ‘ f r a g m e n t s i z e ’ ’ : 18
},
‘ ‘ function ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 ,
‘ ‘ f r a g m e n t s i z e ’ ’ : 18
},
...
... ,
‘ ‘ variable ’ ’ : {
‘ ‘ number of fragments ’ ’ : 5 ,
‘ ‘ f r a g m e n t s i z e ’ ’ : 18
}
}
}
}
Figure 4.6: Module level search query.
20
CHAPTER 5
DATA PRESENTATION
Once the data is indexed, it needs to be presented in a fast and presentable manner. We chose to
create a search engine type of interface. Our goal for package level search was to provide a ranked
list of relevant packages and their details to any given query. For module level search, we wanted
to provide actual source code snippets related to the query. In order to display some of the source
code to the user, preprocessing was necessary.
5.1
Server Setup
Our server has a simple stack setup. We use Nginx1 to handle requests, Flask2 /Python3 to
process and serve our data, and uWSGI4 as an interface between the Nginx and Flask. Elasticsearch
(ES)5 is used to hold our data for package and module level search. To make sure we are using
the latest packages from PyPI, we use sqlite databases to track the modified times of each package.
Much of the rendering and manipulation of the browser interface is done using JavaScript6 . Our
JavaScript library of choice is jQuery7 .
5.2
Browser Interface
The interface features a home page with a simple text box for queries and a choice for either
package or module level search. The results page displays the query information at the top, and
the results themselves below. The user can modify his or her query or change the search type
from package level search to module level search and vice versa. In Chapter 6 we have added some
sample screen shots of the browser interface.
1
https://www.nginx.com/resources/wiki/
http://flask.pocoo.org/
3
https://www.python.org/
4
https://uwsgi-docs.readthedocs.org/en/latest/
5
https://www.elastic.co/products/elasticsearch
6
https://developer.mozilla.org/en-US/docs/Web/JavaScript
7
https://jqueryui.com/
2
21
Figure 5.1: Package modal.
5.3
Package Level Search
When a user sends a query to the server for package level search, the query is processed, and
a ranked list of packages is sent back. Each package is depicted on the browser as a tile (Refer
to Figure 6.3). Tile provides minimal information about the package that includes package name,
author of the package, brief description, number of downloads and score assigned by the ranking
algorithm. The user has the option to click the tile to view more information about the package
and visit PyPI and other sites related to the package. When the user clicks on the tile, a modal
is opened that contains detail description, version, source code homepage, PyPI homepage, score
from ranking algorithm (Refer to Figure 5.1), statistics of the package as bar graph, pie chart and
numbers (Refer to Figure 5.2), other packages from author (Refer to Figure at 5.3). This process
relies on ranking of packages.
22
Figure 5.2: Package statistics.
5.3.1
Ranking of Packages
One of the most important aspects which distinguishes this search engine from others is the use
of many types of data for ranking the packages. In Chapter 3 and Chapter 4 we have discussed
how all the preprocessed information about packages and modules, including basic details from
PyPI are stored in our ES server. We felt that ES relevance algorithm was not thorough enough
to return meaningful results. So we are using a few other metrics, namely, Bing search results 8 ,
number of downloads, the ratio of warnings to the total number of lines, the ratio of comments
to the total number of lines and the ratio of code lines to the total number of lines (gathered by
prospector and CLOC, also visible in package modal referenced in Figure 5.2). All of these metrics
are passed as columns to the ranking algorithm. These columns represent nothing but a matrix.
After the ranking algorithm is executed, it returns a sorted list of packages in reverse order based
on newly generated scores and a dictionary mapping of package names to their scores. Note that
8
http://datamarket.azure.com/dataset/bing/search
23
Figure 5.3: Other packages from author.
ES column and Bing Column are primary columns while others are secondary derived columns.
Tuning these column weights could be tricky. One way to fine tune this algorithm is to try different
combinations of weights and learn which one works better. We can also fine tune this algorithm by
adding more primary columns or secondary derived columns. For example, at the time of writing
this thesis, we were experimenting with PageRank as one of the primary columns. We sought to
calculate PageRank based on import statements in each module. Just like in web pages where a
page “A” links to some other page “B”, in Python modules we have import statements where a
module “C” imports module “D”. This can be considered as a vote cast by “C” to module “D”
and this information can be used to generate PageRank for modules and packages.
Ranking Algorithm.
1. Request ES server for top 20 results for user query.
2. Request Bing for top 20 results for user query.
3. Generate other relevant columns or metrics.
24
4. Pass list of all columns generated in step 1, 2 and 3 to rerank function along with weights for
each column.
5. Inside rerank function
(a) Find max length of columns, i.e. number of rows in the matrix.
(b) Construct a results dictionary for set union of all cells in the matrix with key as the
name of the package and value (score) initiated to 0.
(c) For each row in matrix
i. For each cell in row
A. Find the package score for its position using the formula,
(maxlen − row.rownumber) × weightvector(cell.columnnumber).
Here “maxlen” is the total number of rows in the matrix. “row.rownumber” is
the position of the current row in the matrix. “weightvector(cell.columnnumber)”
gives the weight assigned to the particular column the cell belongs to.
B. Add this score to existing score of that package in the dictionary created in step
5(b).
At the end of this step, each package in results dictionary will have cumulative score for
standing in different columns.
(d) From the results dictionary, generate a reverse sorted list with key as score. This will
give results list with reranked order.
Note that there could be duplicates between top 20 results of ES server and top 20 results of
Bing, so total number of rows in the matrix passed to ranking function is not always 40. Figure
5.4 is the sample pseudocode for the ranking algorithm. Table 5.1 is the sample matrix formed for
keyword “music”. In this table by looking at the cells highlighted with yellow background, you can
notice duplicate name “vls-framework” between primary columns Elasticsearch and Bing. This only
means that both primary columns agree to the fact that this is a relevant result for given keyword.
By looking at the cells highlighted in red, you can notice duplicates within the same primary column
Bing. This scenario occurs because multiple versions of the same package emerge famous. From
the table we can also notice that these duplicates are carry forwarded to secondary columns, thus
influencing the ranks. Allowing duplicates to present in primary columns and allowing them to
carry forward to secondary columns are decisions yet to be made. For now, this practice in ranking
algorithm has looked promising, with positive effects. As we investigate with more use cases, if it
turns out duplicates are influencing negatively, we can always just eliminate them by not disturbing
25
the nature of ranking function. For keyword “music”, Table 5.2 shows the matching packages and
their scores calculated by the ranking function ordered in descending order.
5.4
Module Level Search
Refer to Figure 6.4 to view the template used for Module Level Search. Similar to package level
search, the user sends a query to the server. However, this time around the server returns a list of
code excerpts. One of our concerns was to keep the wait time of queries low. So a preprocessing
step was added to retrieve the code faster.
5.4.1
Preprocessing
To reduce the wait time of searches, the source code needs to be adapted for the browser. As
previously mentioned, Bandernsatch is used to create a local mirror of the PyPI repository on our
server, however, only compressed packages are maintained with Bandersnatch. An uncompression
step is required to examine each Python module in plain text.
Our aim was to show nicely formatted, stylized lines of code. Initially, we were going to send
about twenty lines from the source code to the user and render the code on the user’s side using a
third party JavaScript library such as SyntaxHighlighter9 . This worked well except for multi-line
constructs such as doc strings. Since there is a possibility of missing lines of a doc string during
client-side rendering, the renderer has no way of knowing how to stylize the doc string. Instead,
we fixed this by rendering the code snippets before sending it to the client (server-side rendering).
For this, we used pygments10 , a Python library, for creating stylized code in Python and other
languages for numerous formats such as HTML. This inevitably increases the amount of data sent
from the server, but this ensures that the code is correctly stylized.
Lines of Code (LOC) is another trick used in expediting the code display. It creates a “mapping”
file for each module. The mapping file in this instance stores the exact byte where a new line starts
for each line in a module. When it is time to grab a snippet of code, the server can open the file and
immediately seek to the correct spot rather than search or linearly pass through the rest of lines.
There could be thousand or even more lines of code that we are avoiding by using this technique.
This cuts down on processing time and only requires another step in the preprocessing.
9
10
http://alexgorbatchev.com/SyntaxHighlighter/
http://pygments.org/
26
generateTop20ESResults():
{
#query ES server and get top 20 results
}
generateTop20BingResults():
{
#query bing and get top 20
#filter out real packages from pages
}
generateOtherMetrics(listOfPackagesFromESandBing):
{
for eachpackage in listOfPackagesFromESandBing:
column_downloads = eachpackage.downloads
column_warnings = eachpackage.warnings/eachpackage.lines
column_comments = eachpackage.comments/eachpackage.lines
column_code = eachpackage.code/eachpackage.lines
column_downloads.sortReverse() # more downloads, better the package
column_warning.sort()
# less warnings, better the package
column_comments.sortReverse() # more comments, better the package
column_code.sortReverse()
# more code, better the package
return column_downloads,column_warning,column_comments,column_code
}
rerank(weightvector, matrix):
{
maxlen = max(len(column) for column in matrix)
# Create a dict as keys:0.0 from union of cells in matrix
resultsDictionary = getcDictionaryFromCells(matrix)
# go through the matrix one row at a time
for row in matrix:
for cell in row:
if packagename: # avoiding empty cells
resultsDictionary[cell.packaganame] +=
(maxlen - row.rownumber) * weightvector[cell.column_number]
# higher the score, better the package
resultsList = sortReverse(resultsDictionary, key=score)
# resultsList gives you the order of packages
# resultsDictionary gives score to each package
return resultsList,resultsDictionary
}
# execution starts here
columns = []
columns.append(generateTop20ESResults) # primary column, more weight
columns.append(generateTop20BingResults) # primary column, more weight
columns.extend(generateOtherMetrics(ES+Bing)) # secondary columns
wegithvector = [0.4, 0.2, 0.1, 0.1, 0.1, 0.1] # can vary
rerank(weightvector,columns)
Figure 5.4: Pseudocode for ranking algorithm.
27
Table 5.1: Ranking matrix for keyword music.
Elasticsearch
Bing
Downloads
Warnings/Lines
Comments/Lines
Code/Lines
vk-music
musicplayer
pylast
vk-music
mps-youtube
mopidy-musicbox-webclient
jmbo-music
music
mps-youtube
jmbo-music
mps-youtube
musicplayer
panya-music
mopidy-gmusic
mps-youtube
panya-music
vis-framework
musicplayer
tweet-music
pyspotify
jmbo-music
tweet-music
pyacoustid
mopidy-gmusic
google-music-playlist-importer musicplayer
hachoir-metadata
google-music-playlist-importer cherrymusic
hachoir-metadata
music-score-creator
mps-youtube
pyspotify
music-score-creator
music21
bdmusic
spilleliste
music21
mp3play
spilleliste
music21
music21
raspberry jam
mps-youtube
bdmusic
raspberry jam
pyspotify
music21
kurzfile
gmusicapi
pyacoustid
kurzfile
mp3play
pyspotify
vis-framework
vis-framework
gmusicapi
vis-framework
gmusicapi
mp3play
gmusic-rating-sync
music22
vis-framework
gmusic-rating-sync
bdmusic
pyacoustid
vkmusic
pygame-music-grid
vis-framework
vkmusic
hachoir-metadata
cherrymusic
melody-dl
bdmusic
music21
melody-dl
musicplayer
vis-framework
leftasrain
mopidy-musicbox-webclient music21
leftasrain
musicplayer
gmusicapi
youtubegen
netease-musicbox
raspberry jam
youtubegen
mopidy-gmusic
mps-youtube
marlib
hachoir-metadata
mopidy-musicbox-webclient
marlib
vk-music
mps-youtube
chordgenerator
mp3play
musicplayer
chordgenerator
jmbo-music
vk-music
music21
musicplayer
tempi
panya-music
jmbo-music
raindrop.py
pyacoustid
panya-music
raindrop.py
tweet-music
panya-music
pylast
cherrymusic
cherrymusic
pylast
google-music-playlist-importer tweet-music
-
-
kurzfile
musicplayer
music-score-creator
google-music-playlist-importer
-
-
mopidy-gmusic
musicplayer
spilleliste
music-score-creator
-
-
music-score-creator
music21
raspberry jam
spilleliste
-
-
chordgenerator
music21
kurzfile
raspberry jam
-
-
tweet-music
mps-youtube
vis-framework
kurzfile
-
-
vkmusic
mps-youtube
gmusic-rating-sync
vis-framework
-
-
leftasrain
pyacoustid
vkmusic
gmusic-rating-sync
-
-
gmusic-rating-sync
vis-framework
melody-dl
vkmusic
-
-
vk-music
pyspotify
leftasrain
melody-dl
-
-
tempi
mopidy-musicbox-webclient
youtubegen
leftasrain
-
-
google-music-playlist-importer mp3play
marlib
youtubegen
-
-
marlib
hachoir-metadata
chordgenerator
marlib
-
-
melody-dl
bdmusic
tempi
chordgenerator
-
-
raindrop.py
cherrymusic
raindrop.py
tempi
-
-
spilleliste
mopidy-gmusic
pylast
raindrop.py
-
-
youtubegen
gmusicapi
mopidy-musicbox-webclient
pylast
tempi
28
Table 5.2: Matching packages and their scores for keyword music.
Package Name
Score
Package Name
Score
musicplayer
45.8
music-score-creator
13.8
mps-youtube
44.6
raspberry jam
13.6
music21
39
google-music-playlist-importer
13.5
vis-framework
33
kurzfile
12.5
pyspotify
22.8
spilleliste
12.1
mopidy-gmusic
20.8
gmusic-rating-sync
10.8
gmusicapi
19
vkmusic
10.5
bdmusic
18.6
music22
10.4
hachoir-metadata
17.8
pygame-music-grid
10
jmbo-music
17.7
leftasrain
9.4
mp3play
17.1
melody-dl
9.3
pyacoustid
16.9
pylast
9
vk-music
15.7
netease-musicbox
8.8
panya-music
15.7
chordgenerator
8.2
mopidy-musicbox-webclient
15.7
youtubegen
8
tweet-music
14.6
marlib
7.9
cherrymusic
14.5
tempi
7.1
music
14
raindrop.py
6.2
We limited the number of code snippets that are returned to 20. Sending more than 20 matches
for module level search will generate a huge response and increase the response time. Before sending
the top 20 results, we apply the ranking algorithm discussed for package level search with changes
in input for the ranking function. As of now, there is only one primary column, i.e. results from
the ES server and a list of secondary columns similar to package level, but each one of them points
to module level statistics rather than package level statistics (except for downloads column). For
example, consider column “Warnings/Lines”, for package level search, it is the ratio of the number
of total warnings for package to the number of total lines of code for the package and for the module
level search it is the ratio of the total number of warnings for module to the total number of lines
of code for the module.
After rendering matching code snippets on the front end, for user convenience and interest, we
29
Figure 5.5: Module modal.
have given an option for user to click on a code snippet that enables a modal that displays entire
code from the module. This will give users more visibility for modules. Figure 5.5 represents this
modal.
30
CHAPTER 6
SYSTEM LEVEL FLOW DIAGRAM
Figure 6.1 is a System Level Flow Diagram of PyQuery. A set of Python scripts are run in
batch overnight to generate all the required details mentioned in the Data Collection chapter.
This preprocessed information is in JSON file format. Before executing these batch scripts, the
bandersnatch1 mirror client is executed so that packages are in sync with PyPI2 and we deliver the
most up to date information. All the files generated are either part of package search or module
search.
We maintain separate Elasticsearch (ES)3 indexes for package search and module search. These
indexes are configured to update at regular intervals if there are any changes for the files they point.
They form the core of the PyQuery.
A web interface customized for easy flow of information to users is developed in Flask4 and
deployed using NGINX5 server. Figure 6.2 is PyQuery’s homepage whose design is mainly inspired
from Google’s homepage. It serves separate edge nodes for package search and module search.
When a user hits the package level search page, an AJAX call is made to the edge node responsible
for retrieving matching packages from ES index. Based up on matching packages retrieved from
ES, a set of metrics are formulated and passed to the ranking algorithm as discussed in chapter 5.
This returns a list of packages to requested front end page that is reverse sorted with key as the
score calculated by ranking algorithm. The highest score is positioned on top of others. Figure 6.3
shows result of Package Level Search on PyQuery for keyword “flask”.
When a user hits the module level search page, an AJAX6 call is made to edge node responsible
for retrieving matching modules or lines of code based on the metadata index in ES. After collecting
matching modules and line numbers at which the match happened, the Lines of Code (LOC)
technique discussed in Chapter 5 is executed to quickly capture code snippets from matching
1
https://pypi.python.org/pypi/bandersnatch
https://pypi.python.org/pypi
3
https://www.elastic.co/products/elasticsearch
4
http://flask.pocoo.org/
5
https://www.nginx.com/resources/wiki/
6
http://api.jquery.com/jquery.ajax/
2
31
Figure 6.1: System Level Flow Diagram of PyQuery.
modules and display them on the requested front end page. Figure 6.4 shows result of Module
Level Search on PyQuery for the keyword “quicksort”.
32
Figure 6.2: PyQuery homepage.
33
Figure 6.3: PyQuery package level search template.
34
Figure 6.4: PyQuery module level search template.
35
CHAPTER 7
RESULTS
Our goal was primarily to build a better PyPI search engine. We wanted to make the search more
meaningful. We wanted to a avoid huge list of closely and similarly scored packages. With PyPI
being the state of the art, we will be comparing results of PyQuery with PyPI . For comparison,
we will be searching 5 keywords that directly match with a package name at PyPI (from Table 7.1
to Table 7.5) and 5 generic keywords that would infer the purpose of packages (from Table 7.6 to
Table 7.10).
1. Total results returned by PyPI was highest for keyword “Django” and that amounts to 11,292.
It is approximately 14 of total number of packages on PyPI. On the other hand, for PyQuery
with two primary columns in ranking function is always set to a maximum of 40. As a fact,
on Google 81% of users view only one results page [12], which is close to 10 results.
2. Maximum number of packages, that PyPI assigned the highest score for a keyword was 162
and that is for package “flask”. PyQuery always assigned highest score for only one package.
3. For the first 5 keywords, where we are expecting a package to be in first position due to the
nature of keyword matching the package name, PyPI showed this behavior only 2 out of 5
times. On the other hand, PyQuery exhibited this behavior 5 out of 5 times.
4. For PyPI, distribution of scores is very less spread. Among the top 5 scores, on 4 out of 10
occasions it has assigned same score, on 3 out of 10 occasions it has assigned 4 to be of same
score and on 2 out of 10 it has assigned 3 to be of the same score. PyQuery scores are more
spread.
5. For the last 5 keywords where they don’t directly point to any package name but infer the
need of a developer, we observe that PyQuery results are more appealing, diversely scored
and unique. For example, for keyword ’Web Development Framework’, PyQuery returned
all unique results with packages like Django and Pyramid (widely used web development
frameworks) in top 5. On the other hand, among top 5 for PyPI it had only 3 unique results
with some of them related to web testing framework.
6. Among the top 5 results, on 4 out of 10 occasions, PyPI returned duplicate package names
referring to multiple versions of the same package. PyQuery always returns only one result
per package that is referring to the latest version.
36
Based on above grounds of comparison, it is clear that we have met our goals to improve PyPI,
offer a meaningful search and avoid closely and similarly scored packages.
Table 7.1: Results comparison for keyword - requests.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
requests
5117
28
cache requests, curl to requests, drequests,
helga-pull-requests, jsonrpc-requests
Requests, Requests-OAuthlib, Requests-Mock,
Requests-Futures, Requests-Oauth
9, 9, 9, 9, 9
131.60, 68.50, 51.00, 45.60, 39.90
21
-
1
-
7 (9)
-
1 (131.60)
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Rank (score) of expected match
PyPI
Rank (score) of expected match
PyQuery
Table 7.2: Results comparison for keyword - flask.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
flask
1750
37
airbrake flask, draftin a flask, fireflask, Flask,
Flask-AlchemyDumps
Flask, Flask-Admin, Flask-JSONRPC, FlaskRestless, Flask Debug-toolbar
9, 9, 9, 9, 9
46.70, 44.80, 41.20, 26.10, 25.90
162
-
1
-
4 (9)
-
1 (46.70)
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Rank (score) of expected match
PyPI
Rank (score) of expected match
PyQuery
37
Table 7.3: Results comparison for keyword - pygments.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
pygments
250
29
Pygments, django mce pygments, pygments-asl,
pygments-gchangelog, pygments-rspec
Pygments, pygments-style-github, PygmentsXslfo-Formatter, Bibtex-Pygments-Lexer, Mistune
11, 9, 9, 9, 9
88.10, 31.50, 28.50, 24.90, 19.30
1
-
1
-
1 (11)
-
1 (88.10)
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Rank (score) of expected match
PyPI
Rank (score) of expected match
PyQuery
Table 7.4: Results comparison for keyword - Django.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
Django
11292
38
Django, django-hstore, django-modelsatts,
django-notifications-hq,django-notifications-hq
Django,
Django-Appconf,
Django-Celery,
Django-Nose, Django-Inplaceedit
10, 10, 10, 10, 10
75.30, 25.50, 25.20, 25.20, 23.50
6
-
1
-
1 (10)
-
1 (73.30)
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Rank (score) of expected match
PyPI
Rank (score) of expected match
PyQuery
38
Table 7.5: Results comparison for keyword - pylint.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
pylint
361
25
gt-pylint-commit-hook, plpylint, pylint, pylintpatcher, pylint-web2py
Pylint, Pylint2tusar, Django-Jenkins, Pylama pylint, Logilab-Astng
9, 9, 9, 9, 9
91.90, 21.80, 20.60, 17.30, 16.00
6
-
1
-
3 (9)
-
1 (91.90)
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Rank (score) of expected match
PyPI
Rank (score) of expected match
PyQuery
Table 7.6: Results comparison for keyword - biological computing.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
biological computing
6
23
blacktie, appdynamics, appdynamics, appdynamics, inspyred
BiologicalProcessNetworks, Blacktie, PyDSTool, PySCeS, Csb
3, 2, 2, 2, 2
15.10, 14.10, 12.90, 12.50, 11.80
1
-
1
-
1
-
1, 2, 3, 4, 5
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Relevant packages among top 5
PyPI
Relevant packages among top 5
PyQuery
39
Table 7.7: Results comparison for keyword - 3D printing.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Relevant packages among top 5
PyPI
Relevant packages among top 5
PyQuery
-
3D printing
26
31
fabby, tangible, blockmodel, citygml2stl, demakein
Pymeshio, Demakein, C3d, Bqclient, Pyautocad
7, 7, 6, 5, 4
44.00, 21.50, 18.90, 18.90, 18.60
2
-
1
-
1, 2, 3, 4, 5
-
1, 2, 3, 4, 5
Table 7.8: Results comparison for keyword - web development framework.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Relevant packages among top 5
PyPI
Relevant packages among top 5
PyQuery
-
web development framework
801
32
HalWeb,
WebPages,
robotframeworkextendedselenium2library,
robotframeworkextendedselenium2library,
robotframeworkextendedselenium2library
Django, Pyramid, Pylons, Moya, Circuits
16, 16, 15, 15, 15
65.80, 48.20, 37.60, 32.60, 23.80
2
-
1
-
1, 2
-
1, 2, 3, 4, 5
40
Table 7.9: Results comparison for keyword - material science.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
-
material science
52
29
py bonemat abaqus, MatMethods, MatMiner,
pymatgen, pymatgen
FiPy, Pymatgen, Pymatgen-Db, Custodian,
Mpmath
7, 6, 6, 6, 6
57.50, 55.20, 20.30, 19.70, 17.50
1
-
1
-
1, 2, 3, 4, 5
Note: 4 and 5 are duplicate results.
1, 2, 3, 4
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score
PyPI
# of packages with highest score
PyQuery
Relevant packages among top 5
PyPI
Relevant packages among top 5
PyQuery
-
Table 7.10: Results comparison for keyword - google maps.
Keyword
# of results from PyPI
# of results from PyQuery
Top 5 results from PyPI
Top 5 results from PyQuery
First 5 scores - PyPI
First 5 scores - PyQuery
# of packages with highest score PyPI
# of packages with highest score PyQuery
Relevant packages among top 5 PyPI
Relevant packages among top 5 PyQuery
google maps
290
31
Product.ATGoogleMaps, trytond google maps,
django-google-maps, djangocms-gmaps, FlaskGoogleMaps
Googlemaps,
Django-Google-Maps,
FlaskGoogleMaps, Gmaps, Geolocation-Python
18, 18, 14, 14, 14
50.20, 39.40, 39.20, 39.00, 38.60
2
1
1, 2, 3, 4, 5
Note: Results are relevant to query but
they are missing general purpose packages like
Googlemaps among top 5.
1, 2, 3, 4, 5
41
CHAPTER 8
CONCLUSIONS AND FUTURE WORK
We believe we have succeeded in developing a dedicated search engine for Python packages and
modules. We expect the Python community to widely adopt PyQuery. PyQuery would allow
Python developers to explore well written, widely adopted, famous and highly apt Python packages
and modules for their programming needs. It will offer itself as an encouraging tool in Python
community to follow software engineering practice code reuse.
8.1
Thesis Summary
In this thesis we have proposed some concrete ideas on how to develop a dedicated search
engine for Python packages and modules. We have sought to build an improved version of the state
of the art Python search engine PyPI. Although PyPI is the first and only tool to address this
problem, results from PyPI are found to serve very little use to user needs and requirements. We
have discussed various tools and techniques which are brought together as one single tool called
PyQuery, for facilitating better search, better rank and better package visibility. With PyQuery we
want to bridge the gap between the high demand of means and ways to deliver reusable components
in Python for code reuse and the lack of efficient tools at users disposal to achieve it. In Chapter 1
we discussed the relevance of this problem, our objective and approach towards solving the problem.
In Chapter 2 we highlighted the related work in this area. For package level search, PyPI being
the only search engine that does Python module search, we have elaborated on how PyPI search
algorithm works, offered reasons as to why we think it needs improvement. For module level search,
there isn’t any dedicated code search engine for Python so we have explored code search engines
that work across multiple languages and reasoned the need for a dedicated search engine for Python.
PyQuery is divided into three different components: Data Collection, Data Indexing and Data
Presentation. Since we intend to provide two modes of search operations i.e. Package Level Search
and Module Level Search, at each component we employ a list of tools and techniques to achieve
specific goals related to these modes. In Chapter 3 we discussed Data Collection Module, use of
42
Bandersnatch1 PEP 381 mirror client to clone Python packages locally and later process these
packages using code analysis tools like Prospector2 and CLOC3 . We explored how to make use of
Abstract Syntax Trees (ast) to filter out useful information or metadata from Python modules. We
also addressed JSON file format for saving all this information with an example for each type of
data.
In Chapter 4 we demonstrated how to feed structured data to Elasticsearch (ES)4 and make use
of FS River5 and Analyzer6 plugins to digest the feed data. ES is built on top of Apache Lucene7
and offers a wide variety of methods to configure data indexing and data retrieval. We explained
the purpose behind agreeing to a specific format for the JSON file to collect the data so that we
can make use of configurations ES offers. One such configuration is minimum fragment size. By
minimum fragment size set to 18 and an identifier collected along with the line number as one word
separated by underscores and right filled with underscores until minimum length of 18, we were
able to get a matching identifier and its line number as one single match. This reduced the size
of JSON file indexed in ES drastically and also helps save time to fetch line number from another
key. In this chapter, we have also outlined some sample queries to index and retrieved meaningful
information out of the indexed structured data.
In Chapter 5, we covered the data presentation concepts like browser interface and server setup.
We discussed our implementation of a server side ranking algorithm for package and module level
search, columns involved in ranking metrics and an example view of these columns for a sample
query. Also, we presented our preprocessing implementation of faster code search that involves
generating the starting address byte of each line in a module and transforming code into pygments.
In Chapter 6 we gave an overview of how all three components of PyQuery will work together with
a system level flow diagram. Finally, in Chapter 7 we compared results of PyQuery with that of
PyPI to prove that we have achieved our goal to improve PyPI, offer a meaningful search and avoid
closely and similarly scored packages.
1
https://pypi.python.org/pypi/bandersnatch]
https://github.com/landscapeio/prospector
3
http://cloc.sourceforge.net/
4
https://www.elastic.co/products/elasticsearch
5
https://github.com/dadoonet/fsriver
6
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-snowball-analyzer.html
7
https://lucene.apache.org/core/
2
43
8.2
Recommendation for Future Work
Although PyQuery accomplished the initially established goals, there is definitely scope for
improvement. In this section, we would like to list ways to improve further.
We want to perform a large scale comparison on PyQuery. Currently, we have tested PyQuery
with a set of keywords for which we know matching packages. We observed that PyQuery is doing
better than state of the art, PyPI. Python is an extensive language and people from many different
fields use the Python programming language to solve problems in their respective disciplines. In this
process, there is always a continuous production of packages that are useful. There are thousands of
packages that are pretty famous for various reasons. Knowing in advance all the possible keywords
that map to these packages is nearly impossible. A tool gains popularity and importance only
when it is widely accepted by user base. By reaching out to developers of Python community from
various disciplines, we can gauge how well PyQuery is mapping keywords to right packages. We
want to plan for large scale user surveys by asking professional developers to search for packages
that they use by direct package name and by keywords that infer package purpose. We want to
collect their feedback and learn if PyQuery is meeting their requirements and if it is doing a better
job than PyPI. We would like to list out use cases where PyQuery needs to do better.
We can extend PyQuery to a recommendation system. We can apply collaborative filtering
technique i.e., capture user actions to know their likes and dislikes of Python packages we suggest
and later from this data make predictions of a list of packages a user would find interesting. This
will allow further improvements on PyQuery. If a user trusts a specific author and tends to explore
packages developed by him/her more often, we could make his or her search results more appealing
to him by filtering packages from this author among the set of initially matched packages. If a user
tends to explore packages specific to a field or category, it is most likely that he/she is working in
that field and in future if there is a user management component added to PyQuery and every time
a user login to the website, we can suggest famous packages from his or her field on dashboard or
we can suggest latest news related to updates made on packages related to his or her field. These
are the few of the many possibilities we can do with Collaborative filtering technique, to facilitate
better search operations. This will allow developers to receive the latest information on packages
and help them to make the best of Python packages and modules. Many successful giants in the
field of entertainment like Netflix and Comcast make use of Collaborative filtering technique to
44
always keep their users engaged with their website. Since PyQuery seeks to help developers explore
Python packages it could find great purpose for collaborative filtering.
45
BIBLIOGRAPHY
[1] Caitlin Sadowski, Kathryn T Stolee, and Sebastian Elbaum. How developers search for code:
a case study. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software
Engineering, pages 191–201. ACM, 2015.
[2] Asimina Zaimi, Noni Triantafyllidou, Androklis Mavridis, Theodore Chaikalis, Ignatios Deligiannis, Panagiotis Sfetsos, and Ioannis Stamelos. An empirical study on the reuse of thirdparty libraries in open-source software development. In Proceedings of the 7th Balkan Conference on Informatics Conference, page 4. ACM, 2015.
[3] Andy Lynex and Paul J Layzell. Organisational considerations for software reuse. Annals of
Software Engineering, 5(1):105–124, 1998.
[4] David C Rine and Robert M Sonnemann. Investments in reusable software. a study of software
reuse investment success factors. Journal of systems and software, 41(1):17–32, 1998.
[5] Python Software Foundation. Pypi. https://pypi.python.org/pypi.
[6] Taichino. Pypi ranking. http://pypi-ranking.info/alltime, 2012.
[7] Steven P Reiss. Semantics-based code search. In Software Engineering, 2009. ICSE 2009.
IEEE 31st International Conference on, pages 243–253. IEEE, 2009.
[8] Nullege a search engine for python source code. http://nullege.com/.
[9] Iulian Neamtiu, Jeffrey S Foster, and Michael Hicks. Understanding source code evolution
using abstract syntax tree matching. ACM SIGSOFT Software Engineering Notes, 30(4):1–5,
2005.
[10] Themistoklis Diamantopoulos and Andreas L Symeonidis. Employing source code information
to improve question-answering in stack overflow.
[11] Ast abstract syntax trees. https://docs.python.org/2/library/ast.html.
[12] Bernard J Jansen and Amanda Spink. How are we searching the world wide web? a comparison
of nine search engine transaction logs. Information Processing & Management, 42(1):248–263,
2006.
46
BIOGRAPHICAL SKETCH
My name is Shiva Krishna Imminni and I was born in metropolitan city Hyderabad, India. My
father Mr. Nageswara Rao is a government employee and my mother Mrs. Subba Laxmi is a
homemaker. They are my biggest inspiration and support. I am the elder of two children of my
parents. My sister Ramya Krishna Imminni is very close to my heart and is very special to me.
My family is the guiding force behind the success I have in my career.
I have received Bachelor’s degree from Jawaharlal Nehru Technology University in May 2011
and joined FactSet Research Systems as QA Automation Analyst. At FactSet, I wrote QA Automation scripts in various languages like Java, Ruby and Jscript. I worked on various automation
frameworks like TestComplete and Selenium. I was one among the first three employees hired for
QA Automation process, so I had a lot of opportunities to try various job roles and experiment
with new technologies. Out of all the job roles I had done, I liked training new hires the most. I
was promoted to QAAutomation Analyst 2 in a short span of 1 year and awarded Star Performer
for the year 2013. It is at FactSet I developed Testlogger, a ruby library to generate xml log files;
custom built for QA terminology like <testcase> and <teststep>. I worked at FactSet for 2 years
from November 2011 to December 2013 and gained a diverse experience performing various roles.
I have joined Department of Computer Science at Florida State University as a Master of Science student in Spring 2013. At FSU I have continued to gain professional experience working
part time as Software Developer, Graduate Research Assistant at iDigInfo, Institute of Digital
Information and Scientific Communication. At iDigInfo, I worked on various projects related to research in specimen digitization. Some of these projects include Morphbank, a continuously growing
database of images that scientists use for international collaboration, research and education and
iDigInfo-OCR, an optical character recognition software for digitizing label information of specimen
collections. I also worked as a Graduate Teaching Assistant for Bachelor level Software Engineering
course. As a part of my course work, I have taken a Python course under Dr. Piyush Kumar that
led to my interest in working on PyQuery. Experience I have gained while working on PyQuery
helped me get an intern opportunity. During the Summer of 2015, I interned with Bank of America.
As an intern, I worked on various technologies related to BigData including Hadoop HDFS, Hive
and Impala.
47