The first public dataset we are connecting to Rogo is repo-level data from GitHub, including commits, star history, forks, watchers, languages, and a ton more. We created it ourselves, and we’re making it available to the public for free!
Here’s the link: tryrogo.com.
This dataset contains three tables: Repos, Repository Topics, and Star History. The repos table, with information on ~20,000 repos, covers a range of information including the number of stars and watchers, commit cadence, number of forks, and the number of open issues. This table also tracks changes in repository activity over 30 and 90-day periods. The “Rogo score” is a measure of how healthy a repo's community is, as a way to filter out fads. It is a normalized linear combination of stars, 90-day star delta, commit cadence, and forks. The repository data also includes information about the license and primary language used in each repository. It includes repos created between 2008 and 2023.
The topics table relates each repository to one or more topics, indicating the specific areas of interest that each repository is associated with. (Note that topics are different from tags.)
Lastly, the star history table, which has over 13 million rows, documents the historical evolution of star counts for projects from 2022 to 2023. It provides a daily record of the number of stars each project has received, offering a granular look at how community interest in each project has developed over time.
Here is the a closer look at the dataset:
- ID: Unique identifier of the repository.
- Name: Name of the repository.
- Language: The primary language used in the repository.
- URL: The URL to the repository on GitHub.
- Description: Description of the repository.
- Stars: The total count of stars given to the repository.
- Watchers: The total count of users watching the repository.
- Forks: The total count of forks of the repository.
- Open Issues Count: The total count of open issues in the repository.
- License: The license under which the repository is released.
- Topic ID: Identifier for the topic associated with the repository.
- Repo Name: The name of the repository.
- Author Name: The name of the repository's author.
- Commits: The total number of commits made in the repository.
- Created At: The date when the repository was created.
- Commit Cadence: This is the ratio of the number of commits to the number of days since the repository was created.
- Star Growth: The growth rate of stars over time.
- 90 Day Delta: This is the absolute change in the number of stars over the past 90 days.
- 30 Day Delta: This is the absolute change in the number of stars over the past 30 days.
- Rogo Score: This is the Rogo Score (inspired by the work Two Sigma did on tracking open source repos). It is a way to rank repositories. It's a normalized linear combination calculated using the following equation:
Rogo Score = ( 0.15*(commits) + .30*(stars) + 0.25*(star_growth_delta) + .25*(commit_cadence) + 10 * forks ) / sum(ts_score)
- Topic ID: Unique identifier for the topic.
- Repo ID: Unique identifier of the repository associated with the topic.
- Topic Name: Name of the topic associated with the repository.
- Star History ID: Unique identifier for the star history record.
- Project ID: Identifier of the project associated with the star history.
- Event Date: The date on which the star event occurred.
- Stars: The total count of stars at the time of the event.