Launching Rogo on GitHub Data

Tl;dr: We scraped data from GitHub with info on commits, star history, forks, watchers, languages, and a TON more. Today we’re launching a public version of Rogo on this dataset. All the latest info on trending repos, topics, authors, and more is just seconds away at tryrogo.com!

Published on May 17, 2023 by John Willett

Background on Rogo

If there is one challenge that companies of all shapes, sizes, and industries face, it’s making full use of their data. The majority of decision-makers still can’t access and analyze it.

AI-enabled natural language interfaces (NLIs) have rightly started to dominate how we think about tackling this challenge. The problem is that these tools, which typically use GPT-4 to translate plain English straight into SQL, are not yet a good fit for non-technical people. They work best when the user has a super-specific data question, which typically requires knowing not only how the data is structured but also roughly how the SQL query should look. This is not the case for most users. Even worse, these models often produce wrong answers, and the non-technical user has a hard time fixing the query or (worst of all) even recognizing that it is wrong.

At Rogo, we’re obsessed with the single goal of making our platform the easiest way for non-technical people to analyze data—and to do so accurately and confidently. We’re building our AI-enabled parsing systems with the challenges listed above top of mind. Rogo has guardrails so that users can always tell what they’re looking at and whether it’s correct. It employs a constructive, flexible approach to parsing that helps users who don’t know much about the underlying data’s structure or who don’t know precisely what they want to ask.

To read more about our novel and unique approach to AI & parsing, read here.

Our GitHub dataset

The first public dataset we are connecting to Rogo is repo-level data from GitHub, including commits, star history, forks, watchers, languages, and a ton more. We created it ourselves, and we’re making it available to the public for free!

Here’s the link: tryrogo.com.

This dataset contains three tables: Repos, Repository Topics, and Star History. The repos table, with information on ~20,000 repos, covers a range of information including the number of stars and watchers, commit cadence, number of forks, and the number of open issues. This table also tracks changes in repository activity over 30 and 90-day periods. The “Rogo score” is a measure of how healthy a repo's community is, as a way to filter out fads. It is a normalized linear combination of stars, 90-day star delta, commit cadence, and forks. The repository data also includes information about the license and primary language used in each repository. It includes repos created between 2008 and 2023.

The topics table relates each repository to one or more topics, indicating the specific areas of interest that each repository is associated with. (Note that topics are different from tags.)

Lastly, the star history table, which has over 13 million rows, documents the historical evolution of star counts for projects from 2022 to 2023. It provides a daily record of the number of stars each project has received, offering a granular look at how community interest in each project has developed over time.

Here is the a closer look at the dataset:


Repos
  • ID: Unique identifier of the repository.
  • Name: Name of the repository.
  • Language: The primary language used in the repository.
  • URL: The URL to the repository on GitHub.
  • Description: Description of the repository.
  • Stars: The total count of stars given to the repository.
  • Watchers: The total count of users watching the repository.
  • Forks: The total count of forks of the repository.
  • Open Issues Count: The total count of open issues in the repository.
  • License: The license under which the repository is released.
  • Topic ID: Identifier for the topic associated with the repository.
  • Repo Name: The name of the repository.
  • Author Name: The name of the repository's author.
  • Commits: The total number of commits made in the repository.
  • Created At: The date when the repository was created.
  • Commit Cadence: This is the ratio of the number of commits to the number of days since the repository was created.
  • Star Growth: The growth rate of stars over time.
  • 90 Day Delta: This is the absolute change in the number of stars over the past 90 days.
  • 30 Day Delta: This is the absolute change in the number of stars over the past 30 days.
  • Rogo Score: This is the Rogo Score (inspired by the work Two Sigma did on tracking open source repos). It is a way to rank repositories. It's a normalized linear combination calculated using the following equation:
    Rogo Score = ( 0.15*(commits) + .30*(stars) + 0.25*(star_growth_delta) + .25*(commit_cadence) + 10 * forks ) / sum(ts_score)

Topics
  • Topic ID: Unique identifier for the topic.
  • Repo ID: Unique identifier of the repository associated with the topic.
  • Topic Name: Name of the topic associated with the repository.

Star History
  • Star History ID: Unique identifier for the star history record.
  • Project ID: Identifier of the project associated with the star history.
  • Event Date: The date on which the star event occurred.
  • Stars: The total count of stars at the time of the event.

Analyzing GitHub data in Rogo


Rogo is insanely easy to use. All you do is type your questions into the search bar.

Let’s say you’re interested in open-source projects from OpenAI. (Who isn’t?) Just type: "What are some trending repos from OpenAI?"


That’s it! Those options at the bottom let you view the underlying SQL, as well as the intermediate representation Rogo mapped your query to. Note that these are not always available, depending on how Rogo parsed your query.

If Rogo got it right, give it a thumbs up! If not, give it a thumbs down.

Let’s keep going. Wonder Open AI’s open source virality as a whole has looked recently? Just type: “Show me stars for Open AI over time.”


Maybe you’re just breaking into the open-source AI world and you’re wondering what languages to brush up on. Let’s go ahead and type: “What languages are getting used the most for AI repos?”

This question is a bit vague: How should we measure this? Number of repos using each language? Number of popular repos using each language? We can check the SQL that Rogo generated to see that it’s using total number of stars as the metric.


What else is hot besides AI? Let’s check out what topics are trending overall. Again, this question is a bit vague. Rogo returns multiple results to let you choose how you want to measure popularity. The first chart using star growth, and the second uses overall number of repositories that have grown in the last 30 days.

It's that easy. And this is just the tip of the iceberg!

We’re not done


There are still a million things we’re doing to make Rogo better. A big focus is speed: Right now, results can be slow to come back. We’re also adding more GitHub data, as well as other datasets (like public financials).

Please give us feedback on the platform—big or small! Email team@rogodata.com with any thoughts, questions, or comments.

Most importantly, if you want to put Rogo on your own data, email us at team@rogodata.com.

2023© Rogo Technologies, All Rights Reserved.

Privacy Policy & Terms of Use